Project Introduction

For this project, we are investigating the times in the day when collisions occur using military time, and we will also be exploring any correlations between weather and the severity of the accident. Furthermore, we will see who causes the most accidents locals or out-of-towners and which agencies had the most complete reports.

Project Data

The dataset we are using describes individual collision incidences that happened on county, state, and local roadways within the greater Montgomery County of Maryland. The data collected is from the Automated Crash Reporting System by many agencies such as Montgomery County Police, Gaithersburg Police and more. Within the dataset, many different details are given regarding individual collisions, including the longitudes and latitudes of where a collision happened, the date and time of a collision, and even the street types. The data set also provides information on outside factors such as weather, surface conditions, lighting, and nearby traffic controls. The information about the drivers associated with a collision was also recorded and this data includes the driver license state, if there is evidence of substance abuse, if a driver was distracted, if a driver was at fault, and injury severity. Other factors that were published include vehicle damage, vehicle movement, and the type of vehicle. Some of the information that will be used for this project includes the location, latitude, longitude, the driver’s license state, injury severity, and weather.

Data Source: https://data.montgomerycountymd.gov/Public-Safety/Crash-Reporting-Drivers-Data/mmzv-x632

Initial data exploration

It should be noted that running the complete.cases() function and counting the total number of complete rows showed that all but one row was complete. When investigating that row in the data table, there was no evidence of missing data, except for where an empty string was entered into a cell. It was decided that that particular row could contain important information for some visualizations, so all rows were kept. This left us with a total of 65,841 rows of data to investigate.

There are 32 columns of report information that included information like longitude and latitude of a collision, if any injuries were associated with a collision, and even information on the weather during the time of the collision. There are many possibile angles for investigating this data to find some interesting information.

After using the summary() function, it was easy to see which columns of report information might be useful for investigation and which might not be so useful. The columns labeled “Person.ID” and “Vehicle.ID” had all unique values, so this data would not be helpful for generating informative visualizations. There are also a great number of “N/A”, “Other”, and “Unknown” values that may not be helpful for some visualizations. So there is the need to be aware of when to filter for those values depending on the graphs we hope to create.

By running the str() function on the data, it is learned that much for the data is in factor form. Knowing this will help when creating new data frames for visualizations.

It should also be noted that this data includes collisions beginning January 1, 2015 through March 8, 2018.

Data cleaning and preprocessing

This project relied heavily on categorical data so it was of the utmost importance to determine which categories to include in visualizations and which to exclude. The questionable categories were empty strings, “N/A”, “Other”, and “Unknown” values so there is definitely the need to do some data cleaning depending on the desired visual output. For example, these exact values were discarded when mapping the locations of collisions resulting in injuries, property damage, or fatalities.

Empty Strings

> percent.empty = apply(dat, 2, function(x) (mean(na.omit(x) == "")))
> percent.empty.asc = percent.empty[order(percent.empty, decreasing = TRUE)]
> par(mar = c(4,13,2,1))
> barplot(percent.empty.asc, horiz = TRUE, las = 2, col = "firebrick", main = "Percent Empty Strings per Column",
+         xlab = "Percent Empty Rows", xlim = c(0, 1), cex.names = 0.75, panel.first = grid(nx = NA, ny = NULL))

There are a few data columns that have a much higher percentage of empty string input. This information is important for the barplot created to further investigate the agencies that input collision records with missing data. We wanted to avoid considering columns that are rarely needed for collision reporting. If a column of data was more than 90% empty then that column was discarded from the pool of data for further investigation as those columns were likely filled-in only for special cases, as is the case with columns “Related.Non.Motorist”, “Non.Motorist.Substance.Abuse”, and “Off.Road.Description”. For determining percentage of cells per row that were filled in, only empty strings were considered to be “not filled-in.” “N/A”, “Unknown”, and “Other” entries were considered to be thoughtfully entered values as agents took the time to fill those values into the reports instead of leaving a blank.

“N/A” Strings

> percent.na = apply(dat,2,function(x) round(mean(na.omit(x) == "N/A"),2))
> percent.na.asc = percent.na[order(percent.na, decreasing = TRUE)]
> par(mar = c(4,13,2,1))
> barplot(percent.na.asc, horiz = TRUE, las = 2, col = "firebrick", main = "Percent N/A Strings per Column",
+         xlab = "Percent Rows per Column", xlim = c(0, 1), cex.names = 0.75, panel.first = grid(nx = NA, ny = NULL))

“OTHER” Strings

> percent.other = apply(dat,2,function(x) round(mean(na.omit(x) == "OTHER"),2))
> percent.other.asc = percent.other[order(percent.other, decreasing = TRUE)]
> par(mar = c(4,13,2,1))
> barplot(percent.other.asc, horiz = TRUE, las = 2, col = "firebrick", main = "Percent OTHER Strings per Column",
+         xlab = "Percent Rows per Column", xlim = c(0, 1), cex.names = 0.75, panel.first = grid(nx = NA, ny = NULL))

“UNKNOWN” Strings

> percent.unknown = apply(dat,2,function(x) round(mean(na.omit(x) == "UNKNOWN"),2))
> percent.unknown.asc = percent.unknown[order(percent.unknown, decreasing = TRUE)]
> par(mar = c(4,13,2,1))
> barplot(percent.unknown.asc, horiz = TRUE, las = 2, col = "firebrick", main = "Percent UNKNOWN Strings per Column",
+         xlab = "Percent Rows per Column", xlim = c(0, 1), cex.names = 0.75, panel.first = grid(nx = NA, ny = NULL))

There was the need to decide when to keep “OTHER”, “N/A”, and “UNKOWN” values for visualizations and when these values should be removed. In most cases, these values were discarded completely. This is the case with the visualizations for collision locations, investigating correlations between collision locations, and visualizing the types of accidents occuring for each hour of the day.

For some graphs, “OTHER”, “N/A”, and “UNKOWN” values were important as they could indicate information regarding why a more specific value was not reported. This can be seen in the graph demonstrating how the weather may effect the severity of a collision and if the driver was at fault. In this case, “OTHER”, “N/A”, and “UNKOWN” values were important as their use in the report could signify that the weather was not “out of the ordinary.”

Ultimately, it was necessary to remain vigilante in determining when to discard certain values with the hopes of creating more fair and informative visualizations of the crash data.

Data exploration and visualization

This data exploration began with learning the times during which most accidents occur. By creating a bar graph of the number of accidents per hour of the day, it would be easier to group times of accidents in a more fair manner. The following barplot shows which hours have the highest numbers of accidents and which have the least.

We can now see the hours during which the highest amount of accidents occur and the hours during which the least number of accidents take place. The hours with the highest numbers of accidents corresponds to the rush hours attributed to commutes to and from work.

> # Determine rush hour times
> # get Crash.Date.Time in easy-to-use form:
> date.time = strptime(as.character(dat$Crash.Date.Time), "%m/%d/%Y %I:%M:%S %p", tz = "America/New_York")
> 
> # get just the hours from our usable dat.time:
> hours = date.time$hour + 1
> 
> #create barplot of crashes per hour to determine hours of highest number of crashes
> barplot(table(hours), main = "Number of Crashes per Hour of the Day", col = "firebrick")

It can be observed that the rush hour times are from 8:00 am - 10:00 am and from 4:00 pm - 7:00 pm.

This has led to following time groupings:

Time Group 24-Hour Interval 12-Hour Interval
Rush Hour 8:00 - 10:00, 16:00 - 19:00 8:00 am - 10:00 am, 4:00 pm - 7:00 pm
School Hours 11:00 - 15:00 11:00 am - 3:00 pm
Night Hours 20:00 - 22:00 8:00 pm - 10:00 pm
Late Night Hours 23:00 - 3:00 11:00 pm - 3:00 am
Early Morning Hours 4:00 - 7:00 4:00 am - 7:00 am

Note: The time groupings are in a 24-hour clock system and the corresponding 12-hour clock times are also shown.

Diving deeper into groupings of accidents by time, the time groups were used to determine accident locations throughout Montgomery County. By doing this, we can find accident locations that seem interesting and may warrant further investigation. A particular location of interest was determined to be in the Downtown Silver Spring area. This location had high number of accidents throughout all times of the day, indicating that it may be an area that local civil engineering departments may need to investigate for hazardous conditions.

The following scatter plots demonstrate accident locations using their longitudes and latitudes. The accident locations follow the Montgomery County highways and street systems, creating a map of the county that can be compared to an actual map which can help in identifying locations.

Below each scatter plot of the accidents in the greater Montgomery County, there is a corresponding scatter plot that is zoomed in on the Downtown Silver Spring area for a more detailed view of accident locations and their times.

Accidents during Rush Hour

> #Create DF with Longitude, Latitude, hour of accident, collision type.
> #This DF will be used for many plots
> accident.times = data.frame(Longitude=dat$Longitude, Latitude=dat$Latitude, hour=strptime(as.character(dat$Crash.Date.Time), "%m/%d/%Y %I:%M:%S %p", tz = "America/New_York")$hour + 1, dat$Collision.Type)
> 
> # Create DF with accident.time as whole hour, longitude, latitude, collision type only for rush-hour times:
> rush.hours = c(8, 9, 10, 16, 17, 18, 19)
> accident.rush.hour = accident.times[is.element(accident.times$hour ,rush.hours),]
> 
> # create a color vector:
> rush.colors = c("chocolate","darkorchid", "goldenrod4", "red3", "darkslateblue","firebrick1","firebrick4")
> names(rush.colors) = c(8, 9, 10, 16, 17, 18, 19)
> 
> # Plot accidents:
> plot(accident.rush.hour$Longitude, accident.rush.hour$Latitude, xlim = c(-77.3, -76.92), ylim = c(38.94, 39.2), pch = ".", col=rush.colors[as.character(accident.times$hour)], main = "Montgomery County Accident Locations: Rush Hour", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = rush.hours,col=rush.colors[as.character(rush.hours)],pch=19, cex=0.7)

> # Plot Close-up of Silver Spring: 
> plot(accident.rush.hour$Longitude, accident.rush.hour$Latitude, xlim = c(-77.0175, -77.035), ylim = c(38.98, 39.005), pch = 20, col=rush.colors[as.character(accident.times$hour)], main = "Downtown Silver Spring Accident Locations: Rush Hour", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = rush.hours,col=rush.colors[as.character(rush.hours)],pch=19, cex=0.7)

Accidents during School Hours

> # View map of accidents during school hours
> 
> # Create DF with accident.time as whole hour, longitude, latitude, collision type only for school-hour times:
> school.hours = c(11, 12, 13, 14, 15)
> accident.school.hours = accident.times[is.element(accident.times$hour ,school.hours),]
> 
> # create a color vector:
> school.hours.colors = c("blue", "deepskyblue", "purple", "springgreen","orchid3")
> names(school.hours.colors) = c(11, 12, 13, 14, 15)
> 
> #Plot accidents:
> plot(accident.school.hours$Longitude, accident.school.hours$Latitude, xlim = c(-77.3, -76.92), ylim = c(38.94, 39.2), pch = ".", col=school.hours.colors[as.character(accident.school.hours$hour)], main = "Montgomery County Accident Locations: School Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = school.hours,col=school.hours.colors[as.character(school.hours)],pch=19, cex=0.7)

> # Plot close-up of Silver Spring
> plot(accident.school.hours$Longitude, accident.school.hours$Latitude, xlim = c(-77.0175, -77.035), ylim = c(38.98, 39.005), pch = 20, col=school.hours.colors[as.character(accident.school.hours$hour)], main = "Downtown Silver Spring Accident Locations: School Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = school.hours,col=school.hours.colors[as.character(school.hours)],pch=19, cex=0.7)

Accidents during Night Hours

> # Create DF with accident.time as whole hour, longitude, latitude, collision type only for latenight-hour times:
> night.hours = c(20, 21, 22)
> accident.night = accident.times[is.element(accident.times$hour ,night.hours),]
> 
> # create a color vector:
> night.hours.colors = c("mediumturquoise","mediumvioletred", "midnightblue")
> names(night.hours.colors) = c(20, 21, 22)
> 
> # Plot accidents;
> plot(accident.night$Longitude, accident.night$Latitude, xlim = c(-77.3, -76.92), ylim = c(38.94, 39.2), pch = ".", col=night.hours.colors[as.character(accident.night$hour)], main="Montgomery County Accident Locations: Night Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = night.hours,col=night.hours.colors[as.character(night.hours)],pch=19, cex=0.7)

> # Plot close-up of Silver Spring:
> plot(accident.night$Longitude, accident.night$Latitude, xlim = c(-77.0175, -77.035), ylim = c(38.98, 39.005), pch = 20, col=night.hours.colors[as.character(accident.night$hour)], main="Downtown Silver Spring Accident Locations: Night Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = night.hours,col=night.hours.colors[as.character(night.hours)],pch=19, cex=0.7)

Accidents during Late Night Hours

> # Create DF with accident.time as whole hour, longitude, latitude, collision type only for latenight-hour times:
> latenight.hours = c(1, 2, 3, 23, 24)
> accident.late.night = accident.times[is.element(accident.times$hour ,latenight.hours),]
> 
> # create a color vector:
> late.night.colors = c("blue","purple", "springgreen", "orangered", "yellow2")
> names(late.night.colors) = c(1, 2, 3, 23, 24)
> 
> # Plot accidents:
> plot(accident.late.night$Longitude, accident.late.night$Latitude, xlim = c(-77.3, -76.92), ylim = c(38.94, 39.2), pch = ".", col=late.night.colors[as.character(accident.late.night$hour)], main="Montgomery County Accident Locations: Late Night Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = latenight.hours,col=late.night.colors[as.character(latenight.hours)],pch=19, cex=0.7)

> # Plot close-up of Silver Spring:
> plot(accident.late.night$Longitude, accident.late.night$Latitude, xlim = c(-77.0175, -77.035), ylim = c(38.98, 39.005), pch = 20, col=late.night.colors[as.character(accident.late.night$hour)], main="Downtown Silver Spring Accident Locations: late Night Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = latenight.hours,col=late.night.colors[as.character(latenight.hours)],pch=19, cex=0.7)

Accidents during Early Morning Hours

> # Create DF with accident.time as whole hour, longitude, latitude, collision type only for rush-hour times:
> morning.hours = c(4, 5, 6, 7)
> accident.early.morning = accident.times[is.element(accident.times$hour ,morning.hours),]
> 
> # create a color vector:
> colors = c("blue","purple", "springgreen", "orangered")
> names(colors) = c(4, 5, 6, 7)
> 
> # Plot accidents:
> plot(accident.early.morning$Longitude, accident.early.morning$Latitude, xlim = c(-77.3, -76.92), ylim = c(38.94, 39.2), pch = ".", col=colors[as.character(accident.early.morning$hour)], main="Montgomery County Accident Locations: Early Morning Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = morning.hours,col=colors[as.character(morning.hours)],pch=19, cex=0.7)

> # Plot close-up of Silver Spring:
> plot(accident.early.morning$Longitude, accident.early.morning$Latitude, xlim = c(-77.0175, -77.035), ylim = c(38.98, 39.005), pch = 20, col=colors[as.character(accident.early.morning$hour)], main="Downtown Silver Spring Accident Locations: Early Morning Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = morning.hours,col=colors[as.character(morning.hours)],pch=19, cex=0.7)

Accidents Comparison of Downtown Silver Spring

> layout(matrix(c(1,1,2,2,3,3,
+                 1,1,2,2,3,3,
+                 4,4,5,5,0,0,
+                 4,4,5,5,0,0), nrow = 4, ncol = 6, byrow = TRUE))
> 
> plot(accident.rush.hour$Longitude, accident.rush.hour$Latitude, xlim = c(-77.02, -77.035), ylim = c(38.98, 39.005), pch = 20, col=rush.colors[as.character(accident.times$hour)], main = "Rush Hour", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = rush.hours,col=rush.colors[as.character(rush.hours)],pch=19, cex=0.7)
> 
> plot(accident.school.hours$Longitude, accident.school.hours$Latitude, xlim = c(-77.02, -77.035), ylim = c(38.98, 39.005), pch = 20, col=school.hours.colors[as.character(accident.school.hours$hour)], main = "School Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = school.hours,col=school.hours.colors[as.character(school.hours)],pch=19, cex=0.7)
> 
> plot(accident.night$Longitude, accident.night$Latitude, xlim = c(-77.02, -77.035), ylim = c(38.98, 39.005), pch = 20, col=night.hours.colors[as.character(accident.night$hour)], main="Night Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = night.hours,col=night.hours.colors[as.character(night.hours)],pch=19, cex=0.7)
> 
> plot(accident.late.night$Longitude, accident.late.night$Latitude, xlim = c(-77.02, -77.035), ylim = c(38.98, 39.005), pch = 20, col=late.night.colors[as.character(accident.late.night$hour)], main="late Night Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = latenight.hours,col=late.night.colors[as.character(latenight.hours)],pch=19, cex=0.7)
> 
> plot(accident.early.morning$Longitude, accident.early.morning$Latitude, xlim = c(-77.02, -77.035), ylim = c(38.98, 39.005), pch = 20, col=colors[as.character(accident.early.morning$hour)], main="Early Morning Hours", ylab = "Latitude", xlab = "Longitude")
> legend('topright',legend = morning.hours,col=colors[as.character(morning.hours)],pch=19, cex=0.7)

A side-by-side comparison of the plots showing accidents in the Silver Spring area show that there are many accidents occuring during all hours of the day. What is striking about this comparison is the number of accidents that occured from Latitudes 39.985 - 39.995 and Longitudes -77.024 - -77.0275 during the late night and early morning hours of 3:00 - 4:00. This can be attributed to accidents influenced by alcohol or other substances. When investigating the local nightlife scene through Yelp, it was discovered that local bar laws require that all bars acquire about 50% of their revenue from food sales. So, there are no legal establishments that sell only alcohol, possibly decreasing the likelihood of excessive drinking. Furthermore, many of the downtown Silver Spring restaurants and bars close between 11:00 pm - 12:00 am. Some close later, around 2:00 am and only one was found to close at 3:00 pm. This could mean that there are a few very late-night bar hoppers that do end up driving and getting into the late night accidents reported in this data. This may indicate that there is a need for increased enforcement for the prevention of driving under the influence in the downtown Silver Spring area.

In the case that substances are not influencing these accidents, then the darkness of these late hours may be obstructing the views of drivers in these specific areas. This should send an alert to local governments that improved lighting on these roads should be added to clearer driving conditions.

Upon looking at the data for the 3:00 am and 4:00 am times, there were certainly some that involved alcohol abuse, but most did not. This is a area that would definitely need further investigation by the local city planners.

Types of Accidents

Diving deeper into the investigations of accidents by hours, the next graph demonstrates the types of accidents that occur per hour of the day.

> accident.times = data.frame(Longitude=dat$Longitude, Latitude=dat$Latitude, hour=strptime(as.character(dat$Crash.Date.Time), "%m/%d/%Y %I:%M:%S %p", tz = "America/New_York")$hour + 1, dat$Collision.Type)
> #1: ratio of colision types per hour
> #table showing num accidents depend on hour/collision type
> hr.table=table(accident.times$hour, accident.times$dat.Collision.Type)
> 
> # clean data of UNKNOWN, OTHER, N/A values:
> cols = c("ANGLE MEETS LEFT HEAD ON","ANGLE MEETS LEFT TURN", "ANGLE MEETS RIGHT TURN","HEAD ON","HEAD ON LEFT TURN","OPPOSITE DIR BOTH LEFT TURN","OPPOSITE DIRECTION SIDESWIPE",                       
+           "SAME DIR BOTH LEFT TURN","SAME DIR REAR END", "SAME DIR REND LEFT TURN","SAME DIR REND RIGHT TURN","SAME DIRECTION LEFT TURN","SAME DIRECTION RIGHT TURN",   
+           "SAME DIRECTION SIDESWIPE","SINGLE VEHICLE","STRAIGHT MOVEMENT ANGLE")
> plot(hr.table[,cols], las=1, col=rainbow(16), main = "Ratios of Collisions per Hour of the Day")
> legend('topleft',legend = cols,col=rainbow(16),pch=19, cex=0.6, pt.cex = 1, y.intersp = 1.5)

This graph shows the ratios of different types of accidents on the y-axis and the ratios of number of accidents across the hours of the day on the x-axis. We can see that the bulk of the accidents occur during the commuter rush hours, indicated by the wider bar sizes. The least number of accidents occur during the late night hours, indicated by the slimmer bar widths. The heights of the bars in each column represent the ratios of types of accidents. Taller bars indicate a greater number of accidents, while slim lines indicate the number of accidents of the corresponding type is relatively low. The legend shows the color code for accident types and bars in the graph.

The most represented type of accident that occurs during rush-hour is “Same direction rear-end.” In contrast, the type of accident that occurs the most during the late night hours is due to “Single vehicle.” This is expected as there are more cars on the roads during rush-hour, increasing chances of collisions between vehicles. While the smaller number of cars on the road during the late night hours decrease the likelihood of collisions between vehicles, but increase collisions involving only one car.

A particularly interesting hour is the 7:00 am hour. The number of accidents occuring at 7:00 am was much less than the rush-hour times, but much greater than the early morning times. This transitional period leads to some interesting qualities. One particular aree of interest is in the large increase in the ratio “Same direction rear-end” accidents from 6:00 am. This increase is likely due to the increasing number of vehicles on the roads. Another interesting aspect is the ratio of “Head-on left turn” accidents compared to the 8:00 am hour, when the ratio decreases. Perhaps 7:00 am hour drivers are taking more risks to catch lights than their 8:00 am hour counterparts.

Weather Conditions and Collisions

> par(mar=c(2,3,1,3))
> plot(Latitude ~ Longitude, data=dat, ylim = c(38.95, 39.33), xlim = c(-77.5,-76.9), pch=".", col="grey")
> len = length(sort(table(dat$Weather), decreasing = TRUE))
> vec = rainbow(len)
> names = names(sort(table(dat$Weather), decreasing = TRUE))
> remove = c("N/A", "OTHER", "UNKNOWN")
> names = setdiff(names, remove)
> for(i in 1:len){
+   points(Latitude ~ Longitude, data=dat[dat$Weather==names[i],], col=vec[i], pch=".")
+ }
> legend('topleft', title="Weather Conditions",legend = names, col=vec, pch=19, cex=0.6, pt.cex = 1, y.intersp = 1.5)

For every collision accident what occurred in the county, the graph pinpoints what weather condition was present when it happened. The figure above shows a zoomed up map view of the accidents from the latitude range of 38.95 to 39.33 and the longitude range of -77.5 to -76.9. It appears that most of the accidents happened within the county and not in the outskirt of the county. The graph is limited and does not show N/A, OTHER, UNKNOWN because these three cannot be categorized to the other weather conditions. The graph shows that the majority of the collisions happen downtown or in the town and not on the outskirts of the county.

Locations of Fatal Crashes

> par(mar=c(2,3,1,3))
> plot(Latitude ~ Longitude, data=dat, ylim = c(38.95, 39.33), xlim = c(-77.5,-76.9), pch=".", col="grey")
> len = length(sort(table(dat$ACRS.Report.Type), decreasing = TRUE))
> color = rainbow(len)
> names = names(sort(table(dat$ACRS.Report.Type), decreasing = TRUE))
> for(i in 1:len){
+     points(Latitude ~ Longitude, data=dat[dat$ACRS.Report.Type==names[i],], col=color[i], pch=".") 
+ }
> points(Latitude ~ Longitude, data=dat[dat$ACRS.Report.Type=="Fatal Crash",], col=color[i], pch="x") 
> legend('topleft', title="Report Type",legend = names, col=color, pch=19, cex=0.6, pt.cex = 1, y.intersp = 1.5)

We are creating a visual map of where the accidents happened and what was the ACRS Report type. We used the same zoomed up map from the last graphto see where the three different ACRS Report types of fatal crashes, injury crashes, and property damage crashes. It appears that property damage crashes had the most accidents, then injury crashes, and the fatal crashes had the least. Most of the property damage crashes were mostly located not on the main roads, but it might have been neighborhoods. Furthermore, injury crashes mostly happened on the main roads of the county. We imagined that fatal crashes involved at least someone that could have been hospitalized, so we wanted to know where they happened and we marked them with an X. We found out that the majority of the fatal crashes also happened in the county. However, there was still a lot on the outskirts of the county.

How Weather may Effect Severity of Collision and Who is at Fault

> par(mfrow=c(3,2)) 
> fatal = names(table(dat$ACRS.Report.Type))
> for(i in 1:length(table(dat$ACRS.Report.Type))){
+   par(mar=c(4,9,3,3))
+   tb = table(dat[dat$ACRS.Report.Type==fatal[i] & dat$Driver.At.Fault=="Yes",]$Weather)
+   barplot(tb,horiz = T,las=2, cex.names = 0.5, main=paste(fatal[i], "Driver At Fault"), col = c("indianred", "lightsteelblue4", "orangered2", "seagreen3"))
+   par(mar=c(4,9,3,3))
+   tb = table(dat[dat$ACRS.Report.Type==fatal[i] & dat$Driver.At.Fault=="No",]$Weather)
+   barplot(tb,horiz = T,las=2, cex.names = 0.5, main=paste(fatal[i], "Driver Not At Fault"), col = c("indianred", "lightsteelblue4", "orangered2", "seagreen3"))
+ }

The graphs above show the different weather conditions that were present at the time for each severity level. Furthermore, we compared the difference between drivers at fault and drivers who were not at fault. When comparing the fatal crashes, it appears that clear and cloudy days had the most accidents for drivers at fault. For the case of drivers not at fault clear and rainy weather had the most accidents when excluding any N/A data. The next comparison is for injury crashes, and it was shocking that two graphs between drivers at fault and drivers not at fault look incredibility similar. The last comparison is about property damage crashes, and it appears they are quite identical for both drivers at fault and drivers not at fault. However, it seems that drivers at fault had more accidents than the others.

Accidents Caused by Locals or Visitors

> par(mar=c(2,3,2,3))
> name = names(table(dat$Drivers.License.State))
> maryland = "MD"
> others = c("AB", "AS", "BC", "IT", "MB", "MH", "NL", "NS", "NT", "ON", "PR", "QC", "SK", "UM", "US", "VI", "XX", "YT")
> states = as.vector(setdiff(setdiff(setdiff(name, others),maryland),""))
> barplot(c(sum(dat$Drivers.License.State %in% maryland), sum(dat$Drivers.License.State %in% states), 
+           sum(dat$Drivers.License.State %in% others)), names.arg =c("Local", "USA", "Others"), col = c("deepskyblue3", "lightblue3") )

We are looking at whether the accidents are caused by more locals living in Maryland, or different USA places or other countries. In the figure below it shows that a significant portion of the accidents are from the locals, not a very significant amount from drivers from other states and there are some out of country drivers that were involved in some accidents.

Information About the Agencies Completing Reports

> # Determine threshold for which columns to keep/discard: 
> percent.filled = apply(dat, 2, function(x) (mean(na.omit(x) != "")) * 100)
> 
> # Create DF with only columns at least 90% filled:
> agency.reports.df = dat[, percent.filled >= 90 ]
> 
> # add new column to DF of percentage filled fields per row:
> new.row = apply(agency.reports.df, 1, function(x) (mean(na.omit(x) != "")) * 100)
> agency.reports.df$Percent.Filled = new.row
> 
> # Determine average percentage filled fields per agency
> filled.by.agency = aggregate(Percent.Filled ~ Agency.Name, data = agency.reports.df, mean)
> 
> # Scale the percentages and add values as a new column
> scaled.row = scale(filled.by.agency$Percent.Filled)
> x = scaled.row[,1]
> filled.by.agency$Scaled.Values = x
> 
> # Barplot normalized values:
> par(mar=c(5.1,15,4.1,1))
> barplot(filled.by.agency$Scaled.Values, main = "Z-Score normalization of Percentage Complete Reports per Agency",
+           names.arg = filled.by.agency$Agency.Name, horiz = TRUE, las=1, cex.names = 1, 
+           col = c("skyblue4", "darkslategray2"), panel.first = grid(nx = NA, ny = NULL)) 

The Montgomery County Crash Data relies on the local law enforcement agencies to provide data that is accurate and as complete as possible. It is understandable that agents are unlikely to complete 100% of reports during the stressful occurences of dealing with many collisions. Still, it should be noted which agencies are more likely to fill out more information on reports and which agencies create reports that are missing important information.

Only the report fields with at least 90% of filled-in data was used. Also, only empty cells were considered to be un-filled entries. Any other entry, including “N/A”, were considered to be an entry completed by an agent of the corresponding agency.

The resulting barplot shows that the Rockville agency has the record of completing the most fields of a report, while Gaithersburg has the worst record. This may indicate that agents of Gaithersburg do not feel the need to fill out reports to the extent of other agencies. For more accurate data, it would be important for all agencies to complete reports to the best of their abilities for more accurate assessments. This information may be helpful for future data analysis as well. A data scientist may choose to work solely with reports compiled by one agency over another for a more fair analysis and data interpretation.

It should be noted that it is unclear if the agencies with similar names are in fact the same agencies (e.g. “Rockville” and “Rockville Police Departme”).

Conclusions

One of the greatest obstacles to creating visualizations is how to choose the most fitting graph type and how to process the data in a way that would be most efficiently usable. Creating new dataframes became a common occurance with these visualizations, which allowed for the trimming of unnecessary data and the filtering of missing/unusable data. Further filtering was creating by searching for certain values or ranges of values. Sometimes, there was a continuous need to change Longitude and Latitude boundaries to create a more informative plot that expressed the locations of accidents in a more presentable manner.

While working with this data, finding interesting correlations between data values proved challenging yet interesting. It was surprising to find that so much information goes into the necessary documentation of a collision. For example, the road conditions seem to be an obvious factor in many collisions, but the documentation of it proved to be a very important feature when investigating how influential a particular condition may be for a specific area. This was incredibly helpful information when trying to paint a picture of an environment.

The work of a data scientist seems to be creating a new environment out of an environment that has already existed. Helping to pinpoint certain features that may be influencing particular outcomes and helping others to identify solutions to problems identified. This project has been informative in that it has proven that anyone can find something interesting in the data and can also often find unlikely correlations between data factors.